Introduction
Within the City of New York, does the prevalence and handling of graffiti differ between boroughs? The answer to these two questions is yes, and yes as shown by t-tests that we conducted to compare the mean length of time to resolve cases between boroughs.
Background
The data set contains 21303* observations which are each an instance of graffiti. Each observation has the borough, and neighborhood the graffiti was reported in, the date the incident was filed, and whether or not the case is open or closed. If the case is closed, there is a date for when the incident was closed. The data was collected via reports of graffiti made to the proper authorities, and the data was published by the Department of Sanitation (DSNY).
One major anomaly in the data is the number of cases in which the incident is not closed. Of the 21303 observations, 7,669 are still “ongoing”– this number includes observations that were created over a year ago.
Our goal for the rest of the report will be to analyze and compare and contrast the length of time from a case being opened to being closed between boroughs and whether there is evidence to say with statistical significance that the average case length does indeed differ between boroughs. Furthermore, we will include graphics that not only illustrate the aforementioned but highlight other areas of interest (ie graffiti per capita in the boroughs).
Analysis
First, we tidy up and transform the data to fit the needs of our analysis by creating a column for the number of days for a case to be closed. We also made a column to say whether or not a case is open or closed. But first we will be looking at rates of graffiti in New York without respect whether the cases are open or closed: just prevalence within the boroughs.
Color Coded by Borough:
Manhattan: Green
The Bronx: Pink
Queens: Red
Brooklyn: Orange
Staten Island: Blue
An image of New York City
Prevalence of Graffiti in NYC
| borough | cases | population | case_per_cap | pop_per_case |
|---|---|---|---|---|
| Bronx | 3808 | 2717758 | 0.0014012 | 713.6970 |
| Brooklyn | 9832 | 4970026 | 0.0019783 | 505.4949 |
| Manhattan | 4665 | 3123068 | 0.0014937 | 669.4680 |
| Queens | 2696 | 4460101 | 0.0006045 | 1654.3401 |
| Staten Island | 299 | 912458 | 0.0003277 | 3051.6990 |
Addressing “Complete” Data As mentioned above, 7669 cases in the data set are not closed– we will address those cases later in the project. For now, we will only be looking at observations for which there is an open and a closed date.
| borough | mean | sd | total |
|---|---|---|---|
| Bronx | 95.97027 | 35.11295 | 2119 |
| Brooklyn | 103.44818 | 36.15245 | 6667 |
| Manhattan | 105.43415 | 33.09512 | 3045 |
| Queens | 107.69930 | 38.66792 | 1563 |
| Staten Island | 126.26250 | 54.94030 | 240 |
Tests for Statistical Significance of Closed Data
Looking at these bar graphs, boxplot, and density plot alone, it seems that the average number of days to resolve a case is quite similar between the boroughs.
Is there a difference between the average length of time to close a graffiti case for all borough’s? Conducting an ANOVA test:
Alternate Hypothesis: The means are different
Null Hypothesis: The means are the same
We expect a p-value of ~0.0 if the alternate hypothesis is true
anova_data = closed_graffiti %>%
group_by(borough) %>%
summarize(count = n(),
mean = as.numeric(mean(length)),
sd = sd(length))
res.aov <- aov(length ~ borough, data = closed_graffiti)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## borough 4 283272 70818 54.48 <2e-16 ***
## Residuals 13629 17714759 1300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(res.aov)[[1]][["Pr(>F)"]][[1]]
## [1] 1.214099e-45
The resulting p value from the ANOVA test is 1.214099e-45. From this information we can prove with a high degree of statistical certainty that the means are different and the alternate hypothesis is true.
Is there evidence to suggest that there is indeed a statistically signifigant difference between boroughs however? We use t tests to compare.
# Algorithm: Compare two means to see if the difference between the two is statistically significant.
# return: the score from a two sided test assuming the null hypothesis is true.
t_test = function(mean1, sd1, total1, mean2, sd2, total2)
{
z = (mean1 - mean2) / sqrt((sd1*sd1 / total1) + (sd2*sd2 / total2))
if (total1 > total2) {
df = total2 - 1
}
else { df = total1 - 1 }
p_val = pt(z, df)
return (p_val)
}
Is there a difference between how the average length of time to close a graffiti case in Manhattan vs Brooklyn?
Conducting a t-test results in the following:
# Ho: mu Manhattan = mu Brooklyn
# Ha: mu Manhattan =/= mu Brooklyn
t_test(Brooklyn$mean, Brooklyn$sd, Brooklyn$total,
Manhattan$mean, Manhattan$sd, Manhattan$total)
## [1] 0.003880954
We can expect to see the difference of these means occuring 0.388% of the time. Thus, there is a high statistical significance that the mean number of days to close a graffiti case differs between Brooklyn and Manhattan.
Is there a difference between how the average length of time to close a graffiti case in Queens vs Bronx? Conducting a t-test results in the following:
# Ho: mu Queens = mu Bronx
# Ha: mu Queens =/= mu Bronx
t_test(Bronx$mean, Bronx$sd, Bronx$total,
Queens$mean, Queens$sd, Queens$total)
## [1] 5.631018e-21
We can expect to see the difference of these means occuring ~0.0% of the time. Thus, there is a very high statistical significance that the mean number of days to close a graffiti case differs between Queens and the Bronx.
Addressing Missing Data
As stated in the introduction, 36% of the observations are not “closed”. For many this might be because the city has not yet had the opportunity to clean up the graffiti, but for some it might be due to administrative error (the case is closed but was not marked as such) or the city forgot about the case.
To address this we created a new df ‘open_graffiti’ which contains only the 36% of observations that are not closed. We used mutate to add a column ‘since_open’ which is the number of days since the most recent date (Oct 14 2019). For some this value will be 0 because the case was opened the day this data was publish, for others it will be a number of days exceeding a year and everything in between.
Top Six Observations on the open_graffiti Dataset
| address | borough | created | closed | status | long | lat | length | since_open |
|---|---|---|---|---|---|---|---|---|
| 114 west 14th street | Bronx | 2019-03-07 | NA | Open | NA | NA | NA | 221 days |
| 101 E 163 street | Bronx | 2019-02-19 | NA | Open | NA | NA | NA | 237 days |
| 121N CHRYSTIE STREET | Manhattan | 2019-01-17 | NA | Open | -73.99347 | 40.71886 | NA | 270 days |
| 792 ST NERI WAY | Bronx | 2019-01-07 | NA | Open | NA | NA | NA | 280 days |
| 146 rockaway ave | Brooklyn | 2018-12-18 | NA | Open | -73.91078 | 40.67811 | NA | 300 days |
| 597 39 STREET | Brooklyn | 2018-11-28 | NA | Open | -74.00207 | 40.65016 | NA | 320 days |
How do we separate out cases that are still being worked on vs cases where extending factors might be at play?
Here we take the 95th percentiles for ‘length’ from the df ‘closed_graffiti’. Our logic is that if the value of ‘since open’ for an open case is greater than the value of the 95th quantile of ‘length’ of closed_graffiti, we might assume either or a combination of the following.
## Time difference of 154 days
Of all closed observations of graffiti in NYC, 95% of the cases were closed in 154 days or less. Now, we will take the number of cases by borough that are older than 154 days over the total number of open cases in the borough. This will give us 5 proportions, 1 for each borough, of ‘issue’ cases / total_cases. The larger the proportion, the more cases in that particular borough that are open AND ‘very old’.
| borough | total_issue | total_open | prop |
|---|---|---|---|
| Bronx | 463 | 1689 | 0.2741267 |
| Brooklyn | 904 | 3165 | 0.2856240 |
| Manhattan | 514 | 1620 | 0.3172840 |
| Queens | 442 | 1133 | 0.3901147 |
| Staten Island | 21 | 59 | 0.3559322 |
This graph displays those proportions. In addition, the size of each plot corresponds to the size of the mean number of days to close a case. The lower the placement of the dot, the less ‘issue cases’ that borough has. The smaller the dot, the faster closed cases are closed. Simply, a smaller lower placed dot can be interpreted as “good” and a larger higher placed dot as “bad”.